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Abstract 

In this paper we propose the Structured Deep Neural Net¬ 
work (Structured DNN) as a structured and deep learning al¬ 
gorithm, learning to find the best structured object (such as a 
label sequence) given a structured input (such as a vector se¬ 
quence) by globally considering the mapping relationships be¬ 
tween the structure rather than item by item. When automatic 
speech recognition is viewed as a special case of such a struc¬ 
tured learning problem, where we have the acoustic vector se¬ 
quence as the input and the phoneme label sequence as the out¬ 
put, it becomes possible to comprehensively learned utterance 
by utterance as a whole, rather than frame by frame. Struc¬ 
tured Support Vector Machine (structured SVM) was proposed 
to perform ASR with structured learning previously, but limited 
by the linear nature of SVM. Here we propose structured DNN 
to use nonlinear transformations in multi-layers as a structured 
and deep learning algorithm. It was shown to beat structured 
SVM in preliminary experiments on TIMIT. 

Index Terms: speech recognition, structured learning, deep 
neural network, structured deep neural network. 

1. Introduction 

Hidden Markov Models (HMMs [?]) have been the most suc¬ 
cessful approach for automatic speech recognition for long 
[?, ?]. With the maturity of machine learning, great efforts 
have been made to try to integrate more machine learning con¬ 
cepts into the HMM framework [?, ?] because HMMs are gen¬ 
erative, while many machine learning approaches can be dis¬ 
criminative in addition [?, ?]. Using Deep Neural Networks 
(DNN) [?, ?] with HMM is a good example [?, ?, ?]. In gen¬ 
eral, HMMs consider the phoneme structure by states and the 
transitions among them, but trained primarily on frame level 
regardless of being based on DNN [?, ?] or Gaussian Mixture 
Model(or SGMM [?]). Under HMM framework [?], the hier¬ 
archical structure of an utterance is taken care of by the HMM 
and their states, the lexicon and the language model, which are 
respectively learned separetely from disjoint sets of knowlage 
sources. On the other hand, it is well known that there may ex¬ 
ist some underlying overall structures for the utterances behind 
the signals which may be helpful to recognition. If we can learn 
such structures comprehensively from the signals of the entire 
utterance globally, the recognition scenario may be different. 

On the other hand, structured learning has been substan¬ 
tially investigated in machine learning, which tries to learn 
the complicated structures exhibited by the data. Conditional 
Random Fields (CRF) [?, ?, ?, ?, ?, ?] and structured Support 
Vector Machine (SVM) [?, ?, ?] are good example approaches. 
Recently, structured SVM has been used to perform initial 
phoneme recognition by learning the relationships between the 
acoustic vector sequence and the phoneme label sequence of the 
whole utterance jointly rather than on the frame level or from 


different sets of knowledge sources [?], utilizing the nice prop¬ 
erties of SVM [?] to classify the structured patterns of the utter¬ 
ance with maximized margin. However, both CRF and struc¬ 
tured SVM are linear, therefore limited in analyzing speech 
signals. Another research [?] integrates DNN into structured 
learning but mainly based on Weighted Finite-State Transduc¬ 
ers (WFST). 

In this paper, we extend the above structured SVM ap¬ 
proach to phoneme recognition using a structured DNN includ¬ 
ing nonlinear units in multi-layers, but similarly learning the 
global mapping relationships from an acoustic vector sequence 
to a phoneme label sequence for a whole utterance. Therefore, 
it is a Structured Deep Neural Network(Structured DNN). 

The rest of the paper is organized as follows: Section [2] 
is the overall system architecture, Section [3] introduces struc¬ 
tured feature vector, Section [4] describes details about experi¬ 
ment setup, Section[5]shows the experiment results, and Section 
[6]is conclusion and future work. 

2. Proposed Approach - Structured Deep 
Neural Network 

The whole picture of the concept of the structured DNN for 
phoneme recognition is in Fig. |T] Given an utterance with 
an acoustic vector sequence x and a corresponding phoneme 
label sequence y, we can first obtain a structured feature vec¬ 
tor '5(x, y) representing x and y and the relationships between 
them as in Fig. |TJa) (details of \l/(x, y) are given in Section^, 
and then feed it into either an SVM as in Fig. □ b) or a DNN 
as in Fig. |TJc) to get a score by a scoring function F\ (x, y; 9\ ) 
or -F^x, y; # 2 ). where 9\ and 62 are the parameter sets for the 
SVM and DNN respectively. Because both x and y represent 
the entire utterance by a structure (sequence) and either SVM or 
DNN learns to map the pair of (x, y) to a score on the utterance 
level globally rather than on the frame level, this is structured 
learning optimized on the utterance level. 

2.1. Structured Learning Concepts 

In structured learning, both the desired outputs y t and the in¬ 
put objects x, can be sequences, trees, lattices, or graphs, rather 
than simply classes or real numbers. In the context of super¬ 
vised learning for phoneme recognition for utterances, we are 
given a set of training utterances, (xi, yi ),..., (xjv, yjv) € 
X x Y , where x^ is the acoustic vector sequence of the i- 
th utterance, yt the corresponding reference phoneme label se¬ 
quence, and we wish to assign correct phoneme label sequences 
to unknown utterance. 

We first define a function /(x; 9) = y : X — > Y , map¬ 
ping each acoustic vector sequence x to a phoneme label se¬ 
quence y, where 9 is the parameter set be learned. One way to 
achieve this is to assign every possible phoneme label sequence 



(a) structured feature vector for an utterance 



Figure 1: The concept of Structured SVM and Structured Deep 
Neural Network: (a) the structured feature vector '('(x, y) for 
an utterance, (b) structured SVM and (c) structured DNN. 


y given an acoustic vector sequence x a score by a scoring func¬ 
tion F(x.,y;9) : X x Y —> R, and take the phoneme label 
sequence y giving the highest score as the output of /(x; 6 ), 

/(x; 9) = argmaxT’(x, y; 9). (1) 

yGY 

2.2. Structured SVM 

Base on the maximized margin concept of SVM, we wish to 
maximize not only the score of the correct label sequence, but 
the margin between the score of the correct label sequence and 
those of the nearest incorrect label sequences, and required the 
scoring function f (x, y; 9i) to be linear, 

Fi(x,y;9!) - (9i,V(x,y)) , (2) 

where 4/(x, y) is the structured feature vector mentioned above 
and shown in Figure [T] representing the structured relationship 
between x and y , 9\ is in vector form and (•, ■) represents inner 
product. We can then train the parameter vector 9i using train¬ 
ing instances {(xi,yi),i = 1,2, and then classify the 

desired label y for the acoustic vector sequence x of any un¬ 
known testing utterance using the scoring function F\ (x, y; 9i ) 
with the trained parameter set 9\. This problem can be solved 
with the well known SVM [?], and is referred to as structured 
SVM as in Fig.[T]b). 

2.3. Structured Deep Neural Network (Structured DNN) 

The assumption of the linear scoring function as in 0 made 
structured SVM limited. Instead, the proposed structured DNN 
uses a series of nonlinear transforms to build the scoring func¬ 
tion F 2 (x, y; 0 2 ) with L hidden layers to evaluate a single out¬ 
put value F 2 (x, y; 0 2 ) as in Fig. 0c). 


hi = <j(Wq • ^(x,y)) 

h; = a(Wi-i ■ h;^i), 2 < l < L 

F 2 (x,y;0 2 ) = cj(W l ■ hi), (3) 

where Wi is weight matrix (including the bias) of layer i, cr(-) 
a nonlinear transform (sigmoid is used), hi the output vector 
of hidden layer i, and the set of all DNN parameters (Wo, ITT, 
ITT,..., Wl) is 92 . Note that the last weight matrix Wl is a 
vector, because this DNN gives only a single value as output. 

For an utterance with acoustic vector sequence x = 
(x 1 , x 2 ,..., x M ), where x J is the j-th acoustic vector, there 
can be many possible phoneme label sequences y = 
(y 1 , y 2 ,..., y M ), where y’ is the phoneme label for x J , and a 
reference phoneme label sequence t = (f 1 , t 2 ,..., f M ), where 
t J is the true phoneme label for x J . The label accuracy function 
CT(t, y) for the utterance x can then be calculated as follows, 

1 M 

C x (t,y) ^^^5(t 3 ,y 3 ), ( 4 ) 

V1 3 =1 

where 8(t 3 , y 3 ) is 1 if F = y 3 , 0 otherwise. C x ( t, y) in 0 is 
actually the frame accuracy or one minus the frame error rate. 
When we are more interested in minimizing the phone error 
rate, the definition of the label accuracy function CT(x, y) in 
0 can be modified to reflect that goal. In both cases, the param¬ 
eter set 62 of this DNN is trained by minimizing the following 
loss function, 

L(0 2 ) = -^a(t,y)logF 2 (x,y;0 2 ), (5) 

X 

where L(0 2 ) in 0 is summed over all training utterances, and 
this objective function is defined in a way similar to the cross 
entropy popularly used in DNN training. 

2.4. Inference with Structured DNN 

With the structured DNN trained as above, given the acoustic 
vector sequence x of an unknown utterance, we need to find the 
best phoneme label sequence y for it. For structured SVM in 
subsection |2.2| due to the linear assumption, the learned model 
parameter 9 1 contains enough information to execute the Viterbi 
algorithm to find the best label sequence. This is not true for 
structured DNN. From (|TJ. in principle we need to search over 
all possible phoneme label sequences(iv M for K phonemes and 
M acoustic vectors) for the given acoustic vector sequence and 
pick the one giving the highest score, which is computationally 
infeasible. 

Instead of searching through all possible phoneme label 
sequences, we can start from a random label sequence, and 
then change one phoneme label at a time by going through all 
phoneme labels with all other phoneme labels in the utterance 
fixed. We iterate the overall phoneme label sequences in this 
way until the sequence converges. This referred to as "without 
lattice”. We can also decode using WFST first to generate a lat¬ 
tice, and then, choose the phone label sequence from the lattice 
which give the highest score. Of course in this way, the perfor¬ 
mance is bounded by the quality of the lattice. This is referred 
to as "with lattice”. 















































2.5. Training of Structured DNN 

For each training utterance, again we have K M possible label 
sequences. It is also impossible to train over all these label se¬ 
quences for the training utterances. In structured SVM, there is 
a large margin training algorithm which finds training examples 
to produce the maximum margin. For structured DNN here, 
how to find and choose effective training examples is impor¬ 
tant. Besides the positive examples (reference phoneme label 
sequences for the training utterances), in this work negative ex¬ 
amples (those other than reference label sequences) are chosen 
both by random and by inferencing using the current model. 
The latter is explained below. 

The ’’inferenced label sequences” represent a feedback 
mechanism. When the current structured DNN model is used 
to decode a training utterance and obtain a phoneme label se¬ 
quence, which is far from correct, we add this label sequence 
to help training data to adjust the model. Wtih the training data 
generated, standard backpropagation algorithm can be used to 
update the structured DNN parameters, and additional training 
sequences can be regenerated in each epoch. Because infer¬ 
ence can be performed with or without lattice, same for training. 
For training with lattice, we choose N-best paths and N random 
paths from lattice as the inference label sequences. 

3. Structured Feature Vector 'I' (x, y)or an 
utterance 

Take the MFCC vector or phoneme posteriorgram vectors as the 
acoustic vectors for an utterance of M frames, x = {x J , j = 
1, 2,... M }, and the phoneme label for x J is y 3 . So the task is 
to decode x into the label sequence y = {y J ,j = 1,2, ...M}. 
Since the most successful and well known solution to this prob¬ 
lem is with HMM, we try to encode what HMM has been doing 
into the feature vector 4>(x, y) to be used here. An HMM con¬ 
sists of a series of states, and two most important sets of param¬ 
eters - the transition probabilities between states, and the obser¬ 
vation probability distribution for each state. Such a structure 
is slightly complicated for the work here, so in the preliminary 
work we use a simplified HMM with only one state for each 
phoneme. With this simplification, these two sets of probabilis¬ 
tic parameters can be estimated for each utterance by adding 
up all the counts of the transition between labels (or states) and 
also adding up all the acoustic vectors for each label (phoneme 
or state), then normalizing the results with the length of the ut¬ 
terance. This is shown in Fig.[2ja). 

Assume K is the total number of different phonemes, we 
first define a K dimensional vector A(y 3 ) for y 3 with its k-th 
component being 1 and all other components being 0 if y 3 is 
the k-th phoneme. Tensor product 0 is helpful here, which is 
defined as 

0 : R p x R e —> R p ®, (a 0 6) i+ (j_i)p = Oi x bj, (6) 

where a and b are two ordinary vectors with dimensions P and 
Q respectively. The right half of says a 0 b is a vector of 
dimension PQ, whose \i + ( j — l)P]-th component is the i-th 
component of a multiplied by the j-th component of b. With 
this expression, the feature vector T'(x, y) in Fig. ITTa) to be 
used for evaluating the scoring function Fi(x., y; Oi) in 0 or 
F 2 (x,y;6> 2 ) in § can then be configured as the concatenation 
of two vectors. 



(a) a simple example with (b) a demonstration of how 'kjx, y) 
arbitrary acoustic vector (not normalized yet) is computed. 


Figure 2: A simplified example of feature sequence x = 
fx 1 ,x 2 ,x. 3 ,x. 4 ) and label sequence y = (y 1 , y 2 , y' ? , y 4 ) = 
(■ A,B,B,c ). 


where x = {x\ x 2 ,..., x M } and y = {y 1 , y 2 ,..., y M }. The 
upper half of the right hand side of a is to accumulate the 
distribution of all components of x J for each phoneme in the 
acoustic vector sequence x, and then locate them at different 
sections of components of the feature vector T*(x, y) (corre¬ 
sponding to the observation probability distribution for each 
state or phoneme label estimated with the utterance). The lower 
half of the right hand side of <|7]», on the other hand, is to 
accumulate the transition counts between each pair of labels 
(phonemes or states) in the label sequence y (corresponding to 
state transition probabilities estimated for the utterance). After 
normalizing with the utterance length M (which is also help¬ 
ful to give a good range for input of DNN) 'T(x, y) is then the 
concatenation of the two, so it keeps the primary statistical pa¬ 
rameters of x J for different phonemes y 3 for all x J in x, and the 
transitions between states for all y 3 in y. With enough training 
utterances (x, y) and the corresponding function >T(x, y), we 
can then learn the scoring function F\ (x, y; 9) or F 2 (x. y; 62 ) 
by training the parameters 9\ or 9o. The vector ^(x, y) in 
0 can be easily extended to higher order Markov assumptions 
(transition to the next state depending on more than one pre¬ 
vious states). For example, by replacing the upper half of (7 
with J2n=i X ’ 1 ® A (y") ® A(y n+1 ) and the lower half of ^ 
with J2n=i A(y n ) 0 A(y n+1 ) 0 A(y n+2 ) , we have the sec¬ 
ond order Markov assumption. 

Consider a simplified example for K = 3 (only 3 allowed 
phonemes A, B, C) and an utterance with length M = 4 as 
shown in Fig. f2]b). It is then easy to find that the upper half of 
* (X, y) is Efil x" ® A(y n ) = (1.2, 2.6, 2.7, 2.3,1.5, 2.5)', 
and the lower half of Vf(x, y) is ELi A (y") ® A(y" +1 ) = 
(0, 1, 0, 0, 1,1, 0, 0, 0)'. We therefore have ^(x, y) = \ ■ 
(1.2, 2.6, 2.7, 2.3,1.5, 2.5, 0, 1, 0, 0, 1,1, 0, 0, 0)' . 

4. Experimental Setup 

Initial experiments were performed with TIMIT. We used the 
training set without dialect sentences for training and the core 
testing set (with 24 speakers and no dialect) for testing. The 
models were trained with a set of 48 phonemes and tested with 
a set of 39 phonemes, conformed to CMU/MIT standards [?]. 
we used an online library [?] for structured SVM, and modified 
the kaldi [?] code to implement structured DNN. 


®(x,y) 


J_ f E j A 1i xJ ® A(y J ) \ 

m l v E^T 1 A(y J )®A(y J+1 ) y l ’ 


(7) 


Our experiment is based on Karel’s recipe in kaldi for TIMIT 
script, which used LDA-MLLT-fMLLR features obtained from 



















structured 

SVM 

structured DNN 
(without lattice) 

structured DNN 
(with lattice) 

Karel’s 

recipe 

(a)LDA-MLLT-fMLLR 

38.62 

90.50 

19.98 

18.90 

(b)phone post(kaldi) 

24.32 

30.62 

18.77 

X 

(c)phone post(filter bank) 

30.84 

40.65 

X 

X 


Table 1: Phone Error Rate(%) evaluate on 39 phonemes with different acoustic vectors, and different algorithms, structured SVM, 
structured DNN without lattice, structured DNN with lattice (top 2000-best path), and state-of-the-art kaldi results. 


auxiliary GMM models, as RBM pre-training, frame cross¬ 
entropy training and sMBR. On top of Karel’s recipe, we used 
three sets of acoustic vectors, (a)LDA-MLLT-fMLLR feature 
(40 dimensions), or input to DNN in Karel's recipe; (b)phoneme 
posterior probability (48 dimensions) obtained from the 1943 
DNN output (state posterior) from Karel’s recipe by reducing 
the dimension to mono-phoneme size of 48 with an additional 
layer of DNN (1943 x 48); and (c) phoneme posterior probabil¬ 
ity from filter bank(48 dimensions) obtained by a DNN (4-layer 
of 512 neurons) with filter bank input of context 4-1-4. They 
are respectively referred to as acoustic vectors (a)(b)(c) below. 

5. Experimental Results 

The results are listed in Table [T| The structured DNN with¬ 
out lattice (3 hidden layers 200 neurons per layer) did not per¬ 
form better than structured SVM, although it did learn some 
structured patterns, obviously because of the poor quality of the 
training data (most of them are random), as well as the fact that 
the inference algorithm of changing only one phoneme label at 
a time is actually prone to converge at local maximal. On the 
other hand, training/inferencing with lattice was much better, 
because it offered many effective training examples by pick¬ 
ing up the top N paths from the lattice. This structured DNN 
with lattice gave a phone error rate (PER) of 18.77% which out¬ 
performed structured SVM and is actually slightly better than 
Karel’s recipe of kaldi at 18.90% in Tab le [T| Note that in the 
experiments here, as explained in Sectiorpfwe simply assume 
a single state for a phoneme in 0 which is certainly over¬ 
simplified. 



TOO 200 300 400 500 600 700 800 900 1000 

M (number of neurons per layer) 


Figure 3: Phone Error Rate(%) map for the proposed structured 
DNN with lattice, N = 500 using acoustic vector (b), for differ¬ 
ent values of L (number of hidden layers) and M (number of 
neurons per layer). 


We further used the best acoustic vectors (b) of Table [T] and 
change the number N for the N-best path from lattice in train¬ 
ing/inferencing, and the results are in Table [2] In the case that 
we already know the reference phone label sequence, we can 


choose the path closest to the reference, most different from 
the reference, or randomly choose one, out of the N-best paths. 
These gave the oracle min, oracle max, and random, the first 
three rows in Table [2] For N = 500, 1000 or 2000, the struc¬ 
tured DNN(SDNN) row is always better than random in N-best, 
but with a gap from oracle min. This verified that the structured 
DNN did learn some structure information. Of course, the or¬ 
acle min here is a lower bound for structured DNN with lattice 
proposed here. The last row shows the difference between ran¬ 
dom and structured DNN, which increases as training data(N) 
grows, verified that larger data set provides better results. 

The next experiment is to analyze the phone error rates 
(PER) for different choices of the key hyper-parameters for the 
structured DNN, L number of hidden layers and M number 
of neurons in each hidden layer. Figure [3] is the result, a vi¬ 
sualized PER map for structured DNN with lattice, N = 500 
using acoustic vector (b). The horizontal axis is M where 
M = 100,200, ...1000, and the vertical axis is L where 
L = 1,2, ... 6 . Therefore, the figure consists of 6 x 10 = 60 
data points. The overall performance is approximately 18% 
to 21%, all outperforming the structured SVM, and more or 
less comparable to Karel’s recipe. For this task, better PER 
seemed to be located at less hidden layers and less neurons. If 
M is small, we can use deep networks up to 4 hidden layers, 
whereas M is large, 2 hidden layers is enough. With larger L 
and larger M, DNN overfits training data. The best PER is on 
( L , M) = (1, 500) which was the case in Table]!] 

6. Conclusion and Future Work 

In this paper, we propose a new structured learning architecture, 
structured DNN for phoneme recognition which jointly consid¬ 
ers the structures of acoustic vector sequences and ph/oneme 
label sequences globally. Preliminary test results show that the 
structured DNN out-performed the previously proposed struc¬ 
tured SVM and provided a comparable result to state-of-the-art. 
We will work on multiple states per phone in the future, and 
explore the possibility of structured DNN. 



N = 500 

N = 1000 

N = 2000 

oracle min 

11.25 

10.67 

10.10 

oracle max 

29.52 

30.70 

31.82 

random 

20.88 

21.16 

21.44 

SDNN 

18.91 

19.19 

18.77 

rand - SDNN 

1.97 

1.97 

2.07 


Table 2: Phone Error Rate(%) for structured DNN with lattice 
for different N of the N-best paths from lattice with acoustic 
vectors (b) in Table |Tj 






























